Antoniadis, A., Berruyer, J., & Carmona, R. (1992). Régression non linéaire et applications. Economica.
Data are available on the Kaggle website , URL : https://www.kaggle.com/code/maryamsayagh1/cardiovascular-disease-prediction/data
library(knitr)
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 9, fig.height = 9)
library(devtools)
## Le chargement a nécessité le package : usethis
library(rmarkdown)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(FactoMineR)
library(InformationValue)
library(ISLR)
library(caret)
## Le chargement a nécessité le package : lattice
##
## Attachement du package : 'caret'
##
## Les objets suivants sont masqués depuis 'package:InformationValue':
##
## confusionMatrix, precision, recall, sensitivity, specificity
##
## L'objet suivant est masqué depuis 'package:purrr':
##
## lift
# Read the dataset
df <- read_delim("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/cardio_train.csv", delim = ";")
## Rows: 70000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (13): id, age, gender, height, weight, ap_hi, ap_lo, cholesterol, gluc, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# df <- read_delim("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/cardio_train.csv", delim = ";")
df <- as_tibble(df)
df <- df %>% rename_with(toupper)
df
## # A tibble: 70,000 × 13
## ID AGE GENDER HEIGHT WEIGHT AP_HI AP_LO CHOLE…¹ GLUC SMOKE ALCO ACTIVE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 18393 2 168 62 110 80 1 1 0 0 1
## 2 1 20228 1 156 85 140 90 3 1 0 0 1
## 3 2 18857 1 165 64 130 70 3 1 0 0 0
## 4 3 17623 2 169 82 150 100 1 1 0 0 1
## 5 4 17474 1 156 56 100 60 1 1 0 0 0
## 6 8 21914 1 151 67 120 80 2 2 0 0 0
## 7 9 22113 1 157 93 130 80 3 1 0 0 1
## 8 12 22584 2 178 95 130 90 3 3 0 0 1
## 9 13 17668 1 158 71 110 70 1 1 0 0 1
## 10 14 19834 1 164 68 110 60 1 1 0 0 0
## # … with 69,990 more rows, 1 more variable: CARDIO <dbl>, and abbreviated
## # variable name ¹CHOLESTEROL
dim(df)
## [1] 70000 13
This work is based on a set of R functions especially built to fit and assess a logistic regression model. The response variable is the variable CARDIO which is binary. The predictors are quantitative and categorigal.
# Source the R addons That I have coded for the logistic regression under R
# devtools::source_url("https://github.com/stephaneLabs/Logistic_Addons/logisticRegressionAddons_20230209.R")
source("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/logisticRegressionAddons_20230314.R")
##
## Attachement du package : 'DescTools'
## Les objets suivants sont masqués depuis 'package:caret':
##
## MAE, RMSE
## ResourceSelection 0.3-5 2019-07-22
## Le chargement a nécessité le package : Matrix
##
## Attachement du package : 'Matrix'
## Les objets suivants sont masqués depuis 'package:tidyr':
##
## expand, pack, unpack
##
## Attachement du package : 'arules'
## L'objet suivant est masqué depuis 'package:dplyr':
##
## recode
## Les objets suivants sont masqués depuis 'package:base':
##
## abbreviate, write
## Le chargement a nécessité le package : carData
##
## Attachement du package : 'car'
## L'objet suivant est masqué depuis 'package:arules':
##
## recode
## L'objet suivant est masqué depuis 'package:DescTools':
##
## Recode
## L'objet suivant est masqué depuis 'package:dplyr':
##
## recode
## L'objet suivant est masqué depuis 'package:purrr':
##
## some
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#source("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/logisticRegressionAddons_20230314.R")
# Quick format of the data
df <- df %>% mutate(ID = ID %>% as.numeric(),
AGE = round(as.numeric(AGE) / 365, 1),
HEIGHT = HEIGHT %>% as.numeric(),
WEIGHT = WEIGHT %>% as.numeric(),
BMI = WEIGHT / (HEIGHT / 100)^2,
AP_HI = as.numeric(AP_HI) / 10,
AP_LO = as.numeric(AP_LO) / 10,
GENDER = GENDER %>% as.factor(),
CHOLESTEROL = CHOLESTEROL %>% as.factor(),
GLUC = GLUC %>% as.factor(),
SMOKE = SMOKE %>% as.factor(),
ALCO = ALCO %>% as.factor(),
ACTIVE = ACTIVE %>% as.factor(),
CARDIO = CARDIO %>% as.factor()
)
nrow_init <- nrow(df)
We compute here the BMI index, the approximative age in years, and we divide systolic and diastolic blood pressure by 10 to get the usual range that we know when taking blood pressure a the physician.
observations are considered as outliers when outside ]Q1 - 1.5 x IQR; Q3 + 1.5 x IQR[ range, where Q1 is the 1st Quartile, Q3 the 3rd quartile and IQR the inter-quartiles interval.
quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 70000
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.60 48.40 54.00 53.34 58.40 65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df <- removeOutliers(df, variable = "AGE")$without_df
## Number of rows of initial dataframe : 70000
## Number of rows of toiletted dataframe : 69996
## Percent of rows deleted : 1 %
quantitativeVariableDescription(df = df, variableName = "HEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 159.0 165.0 164.4 170.0 250.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "WEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 65.00 72.00 74.21 82.00 200.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.472 23.875 26.378 27.557 30.222 298.667
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df <- removeOutliers(df, variable = "BMI")$without_df
## Number of rows of initial dataframe : 69996
## Number of rows of toiletted dataframe : 68001
## Percent of rows deleted : 0.97 %
df <- removeOutliers(df, variable = "AP_HI")$without_df
## Number of rows of initial dataframe : 68001
## Number of rows of toiletted dataframe : 65059
## Percent of rows deleted : 0.96 %
df <- removeOutliers(df, variable = "AP_LO")$without_df
## Number of rows of initial dataframe : 65059
## Number of rows of toiletted dataframe : 62011
## Percent of rows deleted : 0.95 %
# systolic blood pressure AP_HI must be superior to diastolic blood pressure AP_LO
df <- df %>% filter(AP_HI > AP_LO)
nrow_clean <- nrow(df)
pct_clean <- ((nrow_init - nrow_clean) / nrow_init) * 100
cat("The cleaned dataset represent ")
## The cleaned dataset represent
cat(pct_clean)
## 11.41571
cat("% of the initial data (in terms of number of rows of the initial dataframe)\n")
## % of the initial data (in terms of number of rows of the initial dataframe)
# Strange paterns wich make think that 0 are NA for some variables
df_quanti <- select(df, c(ID, AGE, HEIGHT, WEIGHT, BMI, AP_HI, AP_LO, CARDIO)) %>% column_to_rownames(var = "ID")
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# pairs(df_quanti, upper.panel = panel.cor, col = c("black", "red")[as.numeric(df_quanti$CARDIO)])
quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.10 48.50 54.00 53.37 58.40 65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.48 23.88 26.22 27.02 29.75 39.74
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "AP_HI")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 12.00 12.00 12.62 14.00 16.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "AP_LO")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.600 8.000 8.000 8.164 9.000 10.400
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Strange paterns wich make think that 0 are NA for some variables
df_quanti <- select(df, c(ID, AGE, HEIGHT, WEIGHT, BMI, AP_HI, AP_LO, CARDIO)) %>% column_to_rownames(var = "ID")
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# pairs(df_quanti, upper.panel = panel.cor, col = c("black", "red")[as.numeric(df_quanti$CARDIO)])
# PCA of data, NA are replaced by the mean of the variable
# PCA replace missing values by the mean of the variable, some variable are highly correlated (colinearity problems ?)
df_PCA <- PCA(X = df_quanti, scale.unit = TRUE, quali.sup = 7, ncp = 6, graph = FALSE)
# 6 components explains 89% of the data
barplot(df_PCA$eig[,2])
kable(df_PCA$eig)
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 2.2698131 | 37.8302176 | 37.83022 |
| comp 2 | 1.4042520 | 23.4042000 | 61.23442 |
| comp 3 | 1.1395236 | 18.9920605 | 80.22648 |
| comp 4 | 0.8895977 | 14.8266285 | 95.05311 |
| comp 5 | 0.2930822 | 4.8847028 | 99.93781 |
| comp 6 | 0.0037314 | 0.0621907 | 100.00000 |
# Discrimination of the 2 diagnostics are not very good in the 1st plane of the PCA
plot(df_PCA,axes = c(1,2), choix = "var")
plot(df_PCA,axes = c(1,2), choix = "ind", habillage = 7)
plot(df_PCA,axes = c(1,3), choix = "var")
plot(df_PCA,axes = c(1,3), choix = "ind", habillage = 7)
plot(df_PCA,axes = c(2,3), choix = "var")
plot(df_PCA,axes = c(2,3), choix = "ind", habillage = 7)
We check here the relationship between the log odds of the diabetes and the predictors. A linear relationship has to be observed to use the predictor as quantitative. If the relationship is not quantitative a transformation or a transformation into factor has to be done.
To get the values for the data of our plot, groups are done with a discretization function and the means of the groups are plotted to see is we get a linear shape. The method used for discretization is the “cluster” option of the function “discretize” of the “arules” package. This method is based on k-means clustering.
# description between the log(odds) and the quantitative predictors (should be linear)
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AGE")
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "HEIGHT")
# Problem
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "WEIGHT")
# Problem
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "BMI")
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_HI")
# Acceptable
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_LO")
# HEIGHT is discretized
df <- df %>% mutate(HEIGHT_grouped = as.factor(discretize(HEIGHT, method = "frequency", breaks = 3)))
attr(df$HEIGHT_grouped, "discretized:breaks") <- NULL
attr(df$HEIGHT_grouped, "discretized:method") <- NULL
HEIGHT_groupedFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "HEIGHT_grouped")
# Frequencies
kable(HEIGHT_groupedFreq$frequencies)
| [120,161) | [161,168) | [168,207] | |
|---|---|---|---|
| 0 | 9887 | 9808 | 11844 |
| 1 | 10084 | 9193 | 11193 |
# Proportions
kable(HEIGHT_groupedFreq$proportions)
| [120,161) | [161,168) | [168,207] | Sum | |
|---|---|---|---|---|
| 0 | 15.94 | 15.82 | 19.10 | 50.86 |
| 1 | 16.26 | 14.83 | 18.05 | 49.14 |
| Sum | 32.21 | 30.64 | 37.15 | 100.00 |
# Chi2 test of independence
print(HEIGHT_groupedFreq$independanceChi2)
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 21.823, df = 2, p-value = 1.825e-05
# Proportions conditioned by outcome
kable(HEIGHT_groupedFreq$ConditionnalProportions1)
| [120,161) | [161,168) | [168,207] | Sum | |
|---|---|---|---|---|
| 0 | 31.35 | 31.10 | 37.55 | 100 |
| 1 | 33.09 | 30.17 | 36.73 | 100 |
# Proportions conditioned by predictor
kable(HEIGHT_groupedFreq$ConditionnalProportions2)
| [120,161) | [161,168) | [168,207] | |
|---|---|---|---|
| 0 | 49.51 | 51.62 | 51.41 |
| 1 | 50.49 | 48.38 | 48.59 |
| Sum | 100.00 | 100.00 | 100.00 |
plotProportionsResponseConditionnedByPredictor(HEIGHT_groupedFreq$ConditionnalProportions2, "HEIGHT")
GENDERFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "GENDER")
GENDERFreq
## $frequencies
## categoPred
## binaryResp 1 2
## 0 20259 11280
## 1 19546 10924
##
## $proportions
## categoPred
## binaryResp 1 2 Sum
## 0 32.67 18.19 50.86
## 1 31.52 17.62 49.14
## Sum 64.19 35.81 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 0.046658, df = 1, p-value = 0.829
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 Sum
## 0 64.23 35.77 100.00
## 1 64.15 35.85 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2
## 0 50.9 50.8
## 1 49.1 49.2
## Sum 100.0 100.0
plotProportionsResponseConditionnedByPredictor(GENDERFreq$ConditionnalProportions2, "GENDER")
CHOLESTEROLFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "CHOLESTEROL")
CHOLESTEROLFreq
## $frequencies
## categoPred
## binaryResp 1 2 3
## 0 26560 3306 1673
## 1 20385 4802 5283
##
## $proportions
## categoPred
## binaryResp 1 2 3 Sum
## 0 42.83 5.33 2.70 50.86
## 1 32.87 7.74 8.52 49.14
## Sum 75.71 13.08 11.22 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 2944.2, df = 2, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 3 Sum
## 0 84.21 10.48 5.30 100.00
## 1 66.90 15.76 17.34 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2 3
## 0 56.58 40.77 24.05
## 1 43.42 59.23 75.95
## Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(CHOLESTEROLFreq$ConditionnalProportions2, "CHOLESTEROL")
GLUCFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "GLUC")
GLUCFreq
## $frequencies
## categoPred
## binaryResp 1 2 3
## 0 27900 1834 1805
## 1 25104 2510 2856
##
## $proportions
## categoPred
## binaryResp 1 2 3 Sum
## 0 44.99 2.96 2.91 50.86
## 1 40.48 4.05 4.61 49.14
## Sum 85.48 7.01 7.52 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 471.39, df = 2, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 3 Sum
## 0 88.46 5.82 5.72 100.00
## 1 82.39 8.24 9.37 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2 3
## 0 52.64 42.22 38.73
## 1 47.36 57.78 61.27
## Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(GLUCFreq$ConditionnalProportions2, "GLUC")
SMOKEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "SMOKE")
SMOKEFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 28575 2964
## 1 27935 2535
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 46.08 4.78 50.86
## 1 45.05 4.09 49.14
## Sum 91.13 8.87 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 22.161, df = 1, p-value = 2.507e-06
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 90.60 9.40 100.00
## 1 91.68 8.32 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 50.57 53.90
## 1 49.43 46.10
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(SMOKEFreq$ConditionnalProportions2, "SMOKE")
ALCOFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "ALCO")
ALCOFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 29786 1753
## 1 28942 1528
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 48.03 2.83 50.86
## 1 46.67 2.46 49.14
## Sum 94.71 5.29 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 9.0248, df = 1, p-value = 0.002663
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 94.44 5.56 100.00
## 1 94.99 5.01 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 50.72 53.43
## 1 49.28 46.57
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ALCOFreq$ConditionnalProportions2, "ALCO")
ACTIVEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "ACTIVE")
ACTIVEFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 5727 25812
## 1 6405 24065
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 9.24 41.63 50.86
## 1 10.33 38.81 49.14
## Sum 19.56 80.44 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 80.494, df = 1, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 18.16 81.84 100.00
## 1 21.02 78.98 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 47.21 51.75
## 1 52.79 48.25
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ACTIVEFreq$ConditionnalProportions2, "ACTIVE")
# Splitting the data into a training and a testing set
# Rename the levels of the factors so that the train function works
make_factor <- function(x){return(factor(x, labels = make.names(levels(x))))}
# train data set
df <- df %>% mutate(CARDIO = make_factor(CARDIO))
df <- df %>% mutate(GENDER = make_factor(GENDER))
df <- df %>% mutate(CHOLESTEROL = make_factor(CHOLESTEROL))
df <- df %>% mutate(GLUC = make_factor(GLUC))
df <- df %>% mutate(SMOKE = make_factor(SMOKE))
df <- df %>% mutate(ALCO = make_factor(ALCO))
df <- df %>% mutate(ACTIVE = make_factor(ACTIVE))
#make this example reproducible
set.seed(1)
#Use 70% of dataset as training set and remaining 30% as testing set
sample_train <- sample(c(TRUE, FALSE), nrow(df), replace = TRUE, prob = c(0.7,0.3))
df_train <- df[sample_train, ]
df_test <- df[!sample_train, ]
# Null model
m0 <- glm(CARDIO ~ 1, family=binomial(link = logit),
data = df_train)
AIC(m0)
## [1] 60033.83
# Initial model
m1 <- glm(CARDIO ~ AGE + GENDER + HEIGHT_grouped + WEIGHT + BMI + AP_HI + AP_LO + CHOLESTEROL + GLUC + SMOKE + ALCO + ACTIVE, family = binomial(link = logit), data = df_train)
VIFTable <- plotVIF(m1)
kable(VIFTable)
| rowName | GVIF | Df | GVIF^(1/(2*Df)) |
|---|---|---|---|
| AGE | 1.022810 | 1 | 1.011340 |
| GENDER | 1.489076 | 1 | 1.220277 |
| HEIGHT_grouped | 4.172220 | 2 | 1.429196 |
| WEIGHT | 13.435491 | 1 | 3.665446 |
| BMI | 12.432973 | 1 | 3.526042 |
| AP_HI | 1.759194 | 1 | 1.326346 |
| AP_LO | 1.742250 | 1 | 1.319943 |
| CHOLESTEROL | 1.512578 | 2 | 1.108995 |
| GLUC | 1.496330 | 2 | 1.106004 |
| SMOKE | 1.248375 | 1 | 1.117307 |
| ALCO | 1.144394 | 1 | 1.069764 |
| ACTIVE | 1.002561 | 1 | 1.001280 |
Some collinearity is detected with a VIF value over 10 for BMI and WEIGHT.
anova(m1, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: CARDIO
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 43313 60032
## AGE 1 2357.1 43312 57675 < 2.2e-16 ***
## GENDER 1 0.1 43311 57675 0.8006
## HEIGHT_grouped 2 1.6 43309 57673 0.4443
## WEIGHT 1 1115.0 43308 56558 < 2.2e-16 ***
## BMI 1 39.7 43307 56518 3.025e-10 ***
## AP_HI 1 6911.4 43306 49607 < 2.2e-16 ***
## AP_LO 1 40.3 43305 49567 2.226e-10 ***
## CHOLESTEROL 2 674.8 43303 48892 < 2.2e-16 ***
## GLUC 2 56.5 43301 48835 5.261e-13 ***
## SMOKE 1 30.9 43300 48804 2.721e-08 ***
## ALCO 1 34.2 43299 48770 4.992e-09 ***
## ACTIVE 1 72.5 43298 48698 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(m1)
## [1] 48729.71
print(computeR2s(m1, m0))
## mcFadden_R2 adjMcFadden_R2 sas_R2 adjSas_R2 dev_R2
## 1 0.1888018 0.1882958 0.2302376 0.3070168 0.1888018
Cstat(m1)
## [1] 0.7860068
HoslemTest(m1)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: model$y, fitted(model)
## X-squared = 318.9, df = 8, p-value < 2.2e-16
residualPlots(m1, binaryresponse = "CARDIO")
Backward elimination is seen as the less bad method for variable selection so let us try it. We discard WEIGHT not because of its p-value but because of collinearity with BMI. BMI seems a better compromise between WEIGHT and HEIGHT.
m <- m1
m <- update(m, . ~ . - WEIGHT)
plotVIF(m)
## rowName GVIF Df GVIF^(1/(2*Df))
## 1 AGE 1.019688 1 1.009796
## 2 GENDER 1.426185 1 1.194230
## 3 HEIGHT_grouped 1.348171 2 1.077547
## 4 BMI 1.082027 1 1.040205
## 5 AP_HI 1.759202 1 1.326349
## 6 AP_LO 1.741904 1 1.319812
## 7 CHOLESTEROL 1.512485 2 1.108978
## 8 GLUC 1.496195 2 1.105980
## 9 SMOKE 1.247997 1 1.117138
## 10 ALCO 1.144285 1 1.069713
## 11 ACTIVE 1.002320 1 1.001159
The VIF values are now < 2.4, the it’s OK.
anova(m, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: CARDIO
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 43313 60032
## AGE 1 2357.1 43312 57675 < 2.2e-16 ***
## GENDER 1 0.1 43311 57675 0.8006
## HEIGHT_grouped 2 1.6 43309 57673 0.4443
## BMI 1 1142.6 43308 56530 < 2.2e-16 ***
## AP_HI 1 6920.1 43307 49610 < 2.2e-16 ***
## AP_LO 1 40.6 43306 49570 1.872e-10 ***
## CHOLESTEROL 2 675.7 43304 48894 < 2.2e-16 ***
## GLUC 2 56.5 43302 48837 5.366e-13 ***
## SMOKE 1 30.5 43301 48807 3.260e-08 ***
## ALCO 1 34.0 43300 48773 5.513e-09 ***
## ACTIVE 1 73.0 43299 48700 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We could discard GENDER considering it’s not a significant factor for this data, but it could be useful to review some litterature before discarding it definitively.
AIC(m)
## [1] 48729.92
print(computeR2s(m, m0))
## mcFadden_R2 adjMcFadden_R2 sas_R2 adjSas_R2 dev_R2
## 1 0.1887651 0.1882924 0.2301984 0.3069645 0.1887651
Cstat(m)
## [1] 0.7859752
HoslemTest(m)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: model$y, fitted(model)
## X-squared = 321.35, df = 8, p-value < 2.2e-16
residualPlots(m, binaryresponse = "CARDIO")
Odds ratios values with confidence intervals
oddsRatios <- exp(coef(m))
oddsRatios <- data.frame(oddsRatios = oddsRatios) %>% rownames_to_column(var = "rowname")
oddsRatiosCI <- as.data.frame(exp(confint(m))) %>% rownames_to_column(var = "rowname")
## Attente de la réalisation du profilage...
oddsRatios <- full_join(oddsRatios, oddsRatiosCI)
## Joining with `by = join_by(rowname)`
names(oddsRatios) <- c("rowname", "oddsRatios", "Lower2.5pct", "Upper97.5pct")
kable(oddsRatios)
| rowname | oddsRatios | Lower2.5pct | Upper97.5pct |
|---|---|---|---|
| (Intercept) | 0.0000028 | 0.0000020 | 0.0000039 |
| AGE | 1.0516232 | 1.0481157 | 1.0551504 |
| GENDERX2 | 0.9436157 | 0.8942425 | 0.9956903 |
| HEIGHT_grouped[161,168) | 1.0707462 | 1.0127119 | 1.1321311 |
| HEIGHT_grouped[168,207] | 1.1353205 | 1.0692226 | 1.2055453 |
| BMI | 1.0325169 | 1.0271331 | 1.0379333 |
| AP_HI | 1.9255924 | 1.8775234 | 1.9752699 |
| AP_LO | 1.1317975 | 1.0852240 | 1.1803339 |
| CHOLESTEROLX2 | 1.4561624 | 1.3605421 | 1.5586241 |
| CHOLESTEROLX3 | 3.0670461 | 2.8058635 | 3.3552141 |
| GLUCX2 | 1.0112132 | 0.9237890 | 1.1070037 |
| GLUCX3 | 0.6807467 | 0.6171558 | 0.7506783 |
| SMOKEX1 | 0.8653778 | 0.7940137 | 0.9430346 |
| ALCOX1 | 0.7385722 | 0.6644722 | 0.8207208 |
| ACTIVEX1 | 0.7902923 | 0.7487356 | 0.8341414 |
Odds ratios values with confidence intervals (graphic)
plotOddsRatios(oddsRatiosDf = oddsRatios)
Looking at the odds ratios for ACTIVE, it seems that level 1 has odds 20.9% less as important thant for level 0. It is counter intuitive but level 1 for ALCO and SMOKE shows respectively less important odds than for levels 0 (about 26,14% less for ALCO and 13.46% less for SMOKE). Is it a problem related to the levels coding ? AP_HI has a strong influence as the odds of the disease strongly increase (92.56%) with an increase of 1 point of AP_HI. It is also the case for AP_LO but in s less important way (about 13.18% per unit of AP_LO). BMI increases the odds of 3.25% by unit of BMI. CHOLESTEROL 2 has odds 45.62% higher than the level 1 and even worse CHOLESTEROL 3 has increased the odds by 206.70% compared to level 1. GENER seems to have a little influence in this data with level 2 having odds 5.64% smaller than level 1. GLUC 3 has odds 31.93% smaller than the level GLUC 1 while the HEIGHT increases the odds by 7.07% for level [161,168) compared to the reference level (under 161cm), and level [168,207] shows odds 13.53% greater to those of the reference level (under 161cm).
model <- m
resultsROC <- rocCurve(model, df_train = df_train, df_test = df_test, outcomeVariable = "CARDIO")
## Training data set
## -----------------
## Dimensions of the testing data set : 43314 15
## length of the outcome variable : 43314
## length of the predicted variable : 43314
## Train data set
## Optimization method : Youden metric
## Training data set
## -----------------
## Dimensions of the testing data set : 18695 15
## length of the outcome variable : 18695
## length of the predicted variable : 18695
## Test data set
## Optimization method : Youden metric
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
kable(resultsROC$optim_train)
| FPR | TPR | MCER | threshold | YoudenMetric | TLCMetric | |
|---|---|---|---|---|---|---|
| 46 | 0.2868899 | 0.7347605 | 0.2758692 | 0.45 | 0.4478707 | 0.3907144 |
kable(resultsROC$optim_test)
| FPR | TPR | MCER | threshold | YoudenMetric | TLCMetric | |
|---|---|---|---|---|---|---|
| 48 | 0.3083442 | 0.7543989 | 0.276491 | 0.47 | 0.4460547 | 0.394203 |